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Abstract 

In this paper, we explore salient questions about user interests, conversations and friendships in the 
Facebook social network, using a novel latent space model that integrates several data types. A key 
challenge of studying Facebook's data is the wide range of data modalities such as text, network 
links, and categorical labels. Our latent space model seamlessly combines all three data modalities 
over millions of users, allowing us to study the interplay between user friendships, interests, and 
higher-order network- wide social trends on Facebook. The recovered insights not only answer our 
initial questions, but also reveal surprising facts about user interests in the context of Facebook's 
ecosystem. We also confirm that our results are significant with respect to evidential information 
from the study subjects. 



Keywords: Facebook data, user interest visualization, multi-view model, topic model, network 
model 



1 Introduction 



From blogs to social networks to video-sharing sites and still others, online social media have 
grown dramatically over the past half-decade. These media host and aggregate information for 
hundreds of millions of users, and this has sired an unprecedented opportunity to study people 
on an incredible scale, and over a broad spectrum of open problems. In particular, the study of 
user interests, conversations and friendships is of special value to the health of a social network 
ecosystem. As a classic example, if we had a good guess as to what a user likes (say, from explicit 
labels or conversations), we could serve her more appropriate content, which may increase her 
engagement with the media, and potentially help to obtain more structured data about her interests. 
Moreover, by providing content that is relevant to the user and her friends, the social network 
can increase engagement beyond mere individual content consumption — witness the explosive 
success of social games, in which players are rewarded for engaging in game activities with friends, 
as opposed to solitary play. 

These examples illustrate how social networks depend on the interplay between user interests, 
conversations and friendships. In light of this, we seek to answer several questions about Facebook: 

• How does Facebook' s social (friendship) graph interact with its interest graph and conversational content? 
Are they correlated? 

• What friendship patterns occur between users with similar interests? 

• Do users with similar interests talk about the same things? 

• How do different interests (say, camping and movies) compare? Do groups of users with distinct interests 
also exhibit different friendship and conversational patterns? 

To answer these questions on the scales dictated by Facebook, it is vital to develop tools that can 
visualize and summarize user information in a salient and aggregated way over large and diverse 
populations of users. In particular, it is critical that these tools enable macroscopic-level study 
of social network phenomena, for there are simply too many individuals to study at fine detail. 
Through the lens of these tools, we can gain an understanding of how user interests, conversations 
and friendships make a social network unique, and how they make \i function. In turn, this can 
shape policies aimed at retaining the special character of the network, or at enabling novel utilities 
to drive growth. 

1.1 Key Challenges 

Much research has been invested in user interest prediction [^|4l[T7in31,'3l], particularly methods 
that predict user interests by looking at similar users. However, existing works are mostly built on 
an incomplete view of the social media data, often solely restricted to user texts. In particular, the 
network itself acts a conduit for information flow among users, and we cannot attain a complete 
view of the social media by ignoring it. Thus, a deep, holistic understanding of user interests and 
of the network as a whole requires a perspective over diverse data modalities (views) such as text, 
network links and categorical labels. To the best of our knowledge, a principled approach that 
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enables such capability has yet to be developed. Hence, our goal is to produce such a system for 
understanding the relationships between user interests, conversations and friendships. 

In developing this system, at least two challenges must be properly addressed. For one, the 
data scale is unprecedented — Facebook has hundreds of millions of active users, with diverse 
modalities of information associated their profiles: textual status updates, comments on other user's 
pages, pictures, and friendships, to name a few. Any method that does not scale linearly in the 
amount of data is bound to fail. The other challenge is the presence of complex structure in 
Facebook' s data; its information is not presented as a simple feature vector, but as a cornucopia 
of structured inputs, multimodal in the sense that text, networks, and label data each seemingly 
requires a different approach to learning. Even the text alone cannot be treated as a simple bag of 
words, for it is separated into many comments and posts, with potentially sharp changes of topics 
and intents. One cannot fully model this rich structure with methods that require user data to be 
input as flat feature vectors, or that require a similarity function between them. 

1.2 Solutions 

With these challenges in mind, we present a scalable machine learning system that we use to 
visualize and explore the interests of millions of users on Facebook, and that potentially scales 
to tens or hundreds of millions of users. The key to this system is a unified latent space model 
jointly over text, network and label data, where some of its building blocks have been inspired by 
earlier successful attempts on certain modalities, such as the supervised Latent Dirichlet Allocation 
model over text and labels f6l, the Mixed Membership Stochastic Blockmodel over networks 
and the joint text/citation topic models of Nallapati et al [18J. We call our model the Supervised 
Multi-view Mixed Membership Model (SM"^), which surmounts the multimodal data challenge by 
transforming user text, network and label data into an integrated latent feature vector for each user, 
and overcomes the scalability challenge by first training model parameters on a smaller subset of 
data, after which it infers millions of user feature vectors in parallel. Both the initial training phase 
and the integrated feature vector inference phase require only linear time and a single pass through 
the data. 

Our system's most important function is visualization and exploration, which is achieved by 
deriving other kinds of information from the data in a principled, statistical manner. For instance, 
we can summarize the textual data as collections of related words, known as topics in the topic 
modeling literature [l6l|5l. Usually, these topics will be coherent enough that we can assign them 
an intuitive description, e.g. a topic with the words "basketball", "football" and "baseball" is best 
described as a "sports" topic. Next, similar to Blei et al. @, we can also report the correlation 
between each topic and the label under study — for instance, if we are studying the label "I vote 
Democratic", we would expect topics containing the words "liberal" and "welfare" to be positively 
correlated with said label. The value of this lies in finding unexpected topics that are correlated 
with the label. In fact, we will show that on Facebook, certain well-known brands are positively 
correlated with generic interests such as movies and cooking, while social gaming by contrast is 
negatively correlated. Finally, we can explain each friendship in the social network in terms of 
two topics, one associated with each friend. The motivation behind this last feature is simple: if 
we have two friends who mostly talk about sports, we would naturally guess that their friendship 
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Input: User Data 



User B: Dislikes movies 



Status Update: I guess my mother hates it. 
Lil<e Page: I've lived in 4 decades, 2 centuries 
and 2 millenniums... and I'm in my 20s! 



User A: Likes movies 



status Update: Hey the hour is late, talk later. 

Status Update: He has a tendency to laugh at inappropriate times. 

Like Page: Slurpee 

Like Page: Twilight 



User C: Likes movies 



Status Update: What about Voldemort's 
nose? Or Harry's parents? 
Like Page: Cabela 



Output: Latent Topic Space 




A is friends with C 



Topic 1: +0.8 Movies 




1.1% of 

normalized 
..^^^ friendships . 


Topic 2: -0.4 Movies 


don care talkstop 

person mean friend stupid mad 

say ... laugh times ... 




girl boy guy boyfriend 
ugly treat girlfriend cute beauty 

text ...hey... 


Topic 3: -1.0 Movies 




0.4% J 0.8%^ 
V 1 


Topic 4: +1.0 Movies 


kiss hand pull bite 


J 




twilight Starbucks basketball 
disney movie nicki minaj 


pant adore dorito smile lip cute 

...tendency inappropriate ... 


0.2% ^ 


subway harry_potter drake ... 

Slurpee voldemort ... 




Topic 5: -0.5 Movies 



song live sing favorite 
hear listen part dear memory 

car ... decade century millennium ... 



4.0% 



Figure 1: From user data to latent topic space, and back (best viewed in color). User data in the form 
of text (status updates and like page titles), friendships and interest labels (e.g. likes/dislikes movies) is 
used to learn a latent space of topics. Topics are characterized by a set of weighted keywords, a positive or 
negative correlation with the interest (e.g +1.0 Movies), and topic-topic friendship probabilities (expressed 
as the percentage of observed friendships, normalized by topic popularity). After learning the topics, we can 
assign the most probable topic to each user word, as well as the most probable topic-pair to each friendship 
— these assignments are represented by word and link colors. Observe that users with lots of green/orange 
words/friendships are likely to be interested in movies, as the corresponding topics (1,4) are detected as 
positive for movies. 

is due to mutual interest in sports. In particular, interests with a high degree of mutual interest 
friendships are valuable from a friendship recommendation perspective. As an example, perhaps 
"sports" is highly associated with mutual interest friendships, but not "driving". When ranking 
potential friends for a user who likes sports and driving, we should prefer friends that like sports 
over friends that like driving, as friendships could be more likely to form over sports. 

From this latent topical model, we can construct visualizations like Figure [3] that summarize 
all text, network and label data in a single diagram. Using this visualization, we proceed with 
the main application of this paper, a cross-study of four general user interests, namely "camp- 
ing", "cooking", "movies", and "sports". Our goal is to answer the questions posed earlier about 
user interests, conversations and friendships in Facebook, and thus glean insight into what makes 
Facebook unique, and how it functions. We also justify our analyses with quantitative results: 
by training a linear classifier ^ on the four interest labels and our system's user feature vectors, 
we demonstrate a statistically significant improvement in prediction accuracy over a bag-of- words 
baseline. 
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2 Algorithm Overview 



Our goal is to analyze Facebook user data in the context of a general concept, such as "movies" 
or "cooking". Each Facebook user is associated with three types of data: text such as (but not 
limited to) user "status updates", network links between users based on friendships, and binary 
labels denoting interest in the concept ("I like movies") or lack thereof ("I don't like movies"). 
Intuitively, we want to capture the relationship between concepts, user text and friendships: for 
a given concept, we seek words correlated with interest in that concept (e.g. talking about actors 
may be correlated with interest in movies), as well as words that are most frequently associated 
with each friendship (e.g. we might find two friends that often talk about actors). By learning and 
visualizing such relationships between the input text, network and label data (see Figure [T]), we can 
glean insight into the nature of Facebook's social structure. 

Combining text and network data poses special challenges: while text is organized into multiple 
documents per user, networks are instead relational and therefore incompatible with feature-based 
learning algorithms. We solve this using an algorithm that learns a latent feature space over text, 
network and label data, which we call SM"^. The SM"^ algorithm involves the following stages: 

1. Train the SM"^ probabilistic model on a subset of user text, network and label data. This learns 
parameters for a K-dimensional latent feature space over text, network and labels, where each feature 
dimension represents a "topic". 

2. With these parameters, we find the best feature space representations of all users' text, network and 
label data. For each user, we infer a K-dimensional feature vector, representing her tendency towards 
each of the K topics. 

3. The inferred user features have many uses, such as (1) finding which topics are most associated with 
friendships, and (2) training a classifier for predicting user labels. 

The feature space consists of K topics, representing concepts and communities that anchor user 
conversations, friendships and interests. Each topic has three components: a vector of word proba- 
bilities, a vector of friendship probabilities to each of the K topics, and a scalar correlation w.r.t the 
user labels. As an example, we might have a topic with the frequent words "baseball" and "basket- 
ball", where this topic has a high self-friendship probability, as well as a high correlation with the 
positive user label "I like sports". Based on this topic's most frequent words, we might give it the 
name "American sports"; thus, we say that users who often talk about "baseball" and "basketball" 
are talking about "American sports". In addition, the high self-friendship probability of the "Amer- 
ican sports" topic implies that such users are likely to be friends, while the high label correlation 
implies that such users like sports in general. Note that topics can have high friendship probabil- 
ities to other topics, e.g. we might find that "American sports" has a high friendship probability 
with a "Restaurants and bars" topic containing words such as "beer", "grill" and "television". 

3 Supervised Multi-View Mixed Membership Model (SM"*) 

Formally, SM"^ can be described in terms of a probabilistic generative process, whose dependencies 
are summarized in a graphical model representation (Figure|2]). Let P be the number of users, V the 
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text vocabulary size, and K the desired number of topics. Also let Di be the number of documents 
for user i, and Wik the number of words in user fs fc-th document. The generative details are 
described below: 

• Topic parameters: 

• For the background vocabulary /J^ac/c? draw: 

• y-dim. word distribution yS^ac/c ^ Dirichlet(?7) 

• For each topic a G {1, . . . , K}, draw: 

• y-dim. topic word distribution Pa- ^ Dirichlet(?7) 

• For each topic pair (a, 6) G {1, . . . , K^, a <b, draw: 

• Topic-topic link probability ^ab ^ Beta(Ai, Aq) 

• User features: For each user i G {1, . . . , P}, draw: 

• User feature vector 6i ~ Dirichlet(a) 

• Text: For each user document (i, fc) G {1, . . . , P} x {1, . . . , Di}: 

• Draw document topic Zik ~ Discrete(^^) 

• For each word ^ G {1, . . . , Wik}, draw: 

• Foreground-background indicator /^H ^ Bernoulli (5) 

• Word Wiki - Discrete((^^.J^^^^(/35acfc)^"^^'0 

• Friendship Links: For each (i, j) G EdgeList, i < j, draw: 

• User i's topic when befriending user j, Sij ~ Discrete(^^) 

• User j's topic when befriending user i, sji ~ Discrete(^j) 

• Link e^^- ~Bernoulli(<l>5.^.,5^..) if Sij < Sji, else e^^- ~Bern.($s^..^s.^.) 

• Labels: For each user i G {1, . . . , P}, draw: 

• Label ^ Normal(07z/, a% where 9, = 

While this generative process may seem complicated at first glance, we shall argue that each com- 
ponent is necessary for proper modeling of the text, network and label data. Additionally, the 
model's complexity does not entail a high runtime — in fact, our SM"^ algorithm runs in linear 
time with respect to the data, as we will show. 

Topics and user data Each user i has 3 data types: text data wi, network links e^^, and interest 
labels Hi G {+1,-1}. In order to learn salient facts about all 3 datatypes seamlessly, we introduce 
a latent space feature vector for each user i, denoted by Oi = {On, . . . , Oik)- Briefly, a high value 
of 6ia indicates that user i's text wi, friendship patterns and label yi are similar to topic a. 

Every topic a G {1, . . . , K} is associated with 3 objects: (1) a V-dim. word probability vector 
15a, (2) link formation probabilities $^6 ^ [0, 1] to each of the K topics 6, and (3) a coefficient Va 
that models the linear dependence of labels yi with topic a. The vector (5a shows which words are 
most salient for the topic, e.g. a "US politics" topic should have high probabilities on the words 
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Figure 2: Graphical model representation of SM"^. Tuning parameters are diamonds, latent variables are 
hollow circles, and observed variables are filled circles. Variables pertaining to labels yi are shown in red. 

"Republican" and "Democrat". The link probabilities represent how likely users talking about 
topic a are friends with users talking about topic 6, e.g. "American sports" having many friendships 
with "Restaurants and bars". Finally, the coefficients Va show the correlation between topic a and 
the user interest labels yi. 

Text model We partition user text data wi into Di documents {wi^i, . . . , Wi^Oi)^ where each doc 
ik is a vector of Wik words {wik^i, . . . , Wik^Wik)- Each document represents a "status update" by 
the user, or the title of a page she "likes". Compared to other forms of textual data like blogs, 
Facebook documents are very short. Hence, we assume each document corresponds to exactly one 
topic Zik, and draw all its words Wiki from the topic word distribution — a notable departure 
from most topic models JUIH, which are tailored for longer documents such as academic papers. 

Moreover, Facebook documents contain many keywords irrelevant to the main topic. For ex- 
ample, the message "I'm watching football with Jim, enjoying it" is about sports, but the words 
"watching" and "with" are not sports-related. To prevent such generic words from influencing 
topic word distributions f3a, we introduce per- word foreground-background boolean indicators 
fiM ^ Bernoulli (5), such that we draw Wiki from P^ik usual when fiki = 1, otherwise we 
draw Wiki from a "background" distribution /^^ac/c- By relegating irrelevant words to a background 
distribution, we can assign topics to entire documents without diluting the topic word distributions 
with generic words. More generally, the idea of having separate classes of word distributions was 
explored in ll20l[T2ll. 



6 



Network model Let Neighbors(i) denote user i's friends, and let EdgeList denote all friendships 
(i, j) for i < j. Also, let e^j G {0, 1} be the adjacency matrix of friendships, where Cij = 1 implies 
G Edgelist. In our model, friendships arise as follows: first, users i,j draw topics Sij and 
Sji from their feature vectors 9i, 9j. Then, the friendship outcome e^j is generated from Sij, Sji — 
this is in contrast to words wiki, which are generated from only one topic zik. Specifically, eij is 
drawn from a upper-triangular K x K matrix of Bernoulli parameters $; we draw e^j from ^sij,SJ^ 
if Sij < Sji, otherwise we draw from $s^^,5,^ . Essentially, $ describes friendship probabilities 
between topics. 

Because the Facebook network is sparse, we only model positive links; the variables Sij , sji, Cij 
exist if and only if e^j = 1. The zero links e^j = are used in a Bayesian fashion: we put a 
Beta(Ai, Aq) prior on each element of $, and set Aq = ln(#[zero links] //C^) and Ai = 0.1, where 
#[zero links] = P{P — l)/2 — | EdgeList |. Thus, we account for evidence from zero links without 
explicitly modeling them, which saves a tremendous amount of computation. 

Label model We extract labels yi G {+1,-1} from users' "liked" pages, e.g. "music" and 
"cooking". By including labels, we can learn which topics are positively/negatively correlated 
with user interests. Similar to sLDA [O, we draw user labels i/i ^ Normal(6>7^, a^), where 9i is the 
average over user i's text topic indicators Zik and network indicators Sij (represented as indicator 
vectors). Put simply, a user's label is a linear regression over her topic vector 9i. 

3.1 Training Algorithm 

Our SM"^ system proceeds in two phases: a training phase to estimate the latent space topic param- 
eters /5, $, z/, cr^ from a smaller subset of users, followed by a parallel prediction phase to estimate 
user feature vectors 9i and friendship topic-pair assignments Sij, Sji for each friendship e^j = L In 
particular, the sij , Sji provide the most likely "explanation" for each friendship, and this forms a 
cornerstone of our data analysis in Section [6| 

Right now, we shall focus on the details of the training algorithm. Our first step is to simplify 
the training problem by reducing the number of latent variables, through analytic integration of 
user feature vectors 9 and topic word/link parameters (5,^ via Dirichlet-Multinomial and Beta- 
Binomial conjugacy. Hence, the only random variables that remain to be inferred are z, f , s (which 
now depend on the tuning parameters a,r],S). Once z, f , s have been inferred, we can recover the 
topic parameters /3, $ from their values. We also show that our algorithm runs in linear time w.r.t 
the amount of data, ensuring scalability. 

Training Algorithm ([T]) alternates between Gibbs sampling on z, f , s, Metropolis-Hastings on 
tuning parameters a,r],5, and direct maximization of a^. This hybrid approach is motivated 
by simplicity — Gibbs samplers for models like ours [11] are easier to derive and implement than 
alternatives such as variational inference, while a, ry, 5 are easily optimized through the Metropolis- 
Hastings algorithm. As for the Gaussian parameters a^, the high dimensionality of v makes 
MCMC convergence difficult, so we resort to a direct maximization strategy similar to sLDA L6J. 
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3.1.1 Gibbs sampler for latent variables z, f , s 

Document topic indicators z A Gibbs sampler samples every latent variable, conditioned on the 
current values of all other varibles. We start by deriving the conditional distribution of Zik'. 

F{zik = m I z_ik, f , w, s, e, y) (1) 
F{yi I Zik = m, Zi_k, Si)F{wik. \ zik = m, z_ik, iik.,^^-ik-) 
X ¥{zik = m I Zi_k,^i) 

PC cxp [ ~ 1 ^^^^ + ^'^=' ^-"^-^ + 

X (#[{z^_fc,Si} = m]+a), 

where we use the fact that ^{wiu \ Zik = m, fiki = 0, z_^/e, w_^jt.) is independent of Zik, and where 
we define 

Ay = |{(x, ix) I (x, y) (i, fc) A /a:^^ = 1 A z^:?/ = ^ A ^^^^^ = 
\{u I /ifcu = 1 A Wi],^^ v}\, 

where is the number of non-background words w = ^ assigned to topic m and not belonging 
to user i and document fc, and is similar but for words belonging to user/document ik. Note 
that Qi in the exp is a function of zi]^, and was defined in Section [sj 

The distribution of zi-^ is composed of a prior term for Zi-^ = m and two posterior terms, one 
for user z's label yi, and one for document z/c's words Wi]^., The posterior term for yi is a Gaussian, 
while the posterior term for wi]^. is a Dirichlet Compound Multinomial (DCM) distribution, which 
results from integrating the word distribution Notice that background words, i.e. wiu such 
that jiu = 0, do not show up in this posterior term. Finally, the z^/e prior term is the DCM from 
integrating the feature vector Qi, 

Importantly, the counts A^, By can be cached and updated in constant time for each Zik being 
sampled, and therefore Eq. ([T]) can be computed in constant time w.r.t. the number of documents. 
Hence, sampling all z takes linear time in the number of documents. 

Word foreground-background indicators f The conditional distribution of fi^ is 

'iki = 1 I z,f_^H, w,s,e,y) (2) 

I zJiM = l,f-iH, w_^h)P(/^h = 1) 
X [F{wiki I zjiki = l^i-iki^^-ikiWifiM = 1) 
+ I z, fiki = 0, f-iki, ^-iki)^{fiki = 0)]"^ 



where Ey = \{{x,y,u) \ {x,y,u) ^ {i,k,^) A f^yy = 1 

and Fy = \{{x,y,u) \ {x,y,u) ^ {i,kj) A f^yu = A w^yy = v}\. 
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Ey is the number of non-background words w = ^ assigned to topic Zik, excluding Wiki. Fy is 
similar, but for background words (regardless of topic indicator z). 

Ignoring the normalizer, the distribution of fiki contains a posterior term for Wiki and a prior 
term for fiki. Again, the Wiki term is a DCM; this DCM comes from integrating if fiki = 1, 
otherwise it comes from integrating the background word distribution /J^ac/c- The fi^ prior is a 
simple Bernoulli(5). As with Eq. ([T]), the counts Ey, Fy can be cached with constant time updates 
per fiki, thus sampling all f is linear time in the number of words w. 



Link topic indicators s Recall that we only model Sij, sji, Cij for positive links e^j = 1. For 
convenience, let eji = Cij for all i < j. The resulting conditional distribution of Sij is 



Z, f 1 W, S—ij , Cij 




,y) 



(3) 



)P(62j — 1 I Sij — Sjii ^—{ijjiy 1 ^- 



(#[{zi,Si_j} = m]+a), 



\{{x,y) G EdgeList | {x,y) 7^ (z, j) A [{s^y^Syx) = (m, 5 

y{sxy^Syx) = {sji,m)]}\ ifi<j 
\{{y,x) G EdgeList | ^ (ij) A [{s^^y.Sy^) = (m.Sji) 

y{sxy,Syx) = {sji,m)]}\ ifi > j. 



C is the number of positive links e \ e^j whose topic indicators {s^y, Syx) are identical to the topics 
{sij, Sji) of Cij. The OR clauses simply take care of situations where Sxy > Syx and/or Sij > sji. 
The distribution of Sij contains a prior term for Sij = m (the DCM from integrating 9i), a Gaussian 
posterior term for yi, and a link posterior term for Cij (the Beta Compound Bernoulli distribution 
from integrating out the link probability ^rn.sji)- 

Like Eq. (T]2), C can be cached using constant time updates per Sij, thus sampling all s is 
linear in the number of friendships | EdgeList |. Combined with the constant time sampling for Eq. 
we see that the SM"^ algorithm requires linear time in the amount of data. 



3.1.2 Learning tuning parameters a, rj, 6 and 

We automatically learn the best tuning parameters a,r],5 using Independence Chain Metropolis- 
Hastings, by assuming a,r] are drawn from Exponential 1), while S is drawn from Beta(l, 1). 
For cr^, we take a Stochastic Expectation-Maximization lITOll approach, in which we maximize 
the log-likelihood with respect to based on the current Gibbs sampler values of z, s. The 
maximization has a closed-form solution similar to sLDA [6J, but without the expectations: 

where Ais a P x K matrix whose i-th row is the current Gibbs sample of 9i, and 6 is a P- vector 
of user labels yi. 
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Algorithm 1 SM"^ Training Algorithm 

1: Input: Training user text data w, links e and labels y 

2: Randomly initialize z, f , s and parameters a, r], 5, 

3: Set Ai, Ao according to Section|3} Network Model 

4: repeat 

5: Gibbs sample all z, f , s using Eqs. ( |1|2|3| ) 

6: Run Metropolis-Hastings on tuning parameters a,r],5 

7: Maximize parameters using Eq. ^ 

8: until Iteration limit or convergence 

9: Output: Sufficient statistics for z, f , s, and all parameters a, 77, 5, Ai, Aq, i^, cr^ 



Algorithm 2 SM"^ Parallelizable Prediction Algorithm 

1: Input: Parameters /5, $, a, 5, z/, from training phase 

2: Input: Test user p's text data 

3: Randomly initialize z^, for the test user 

4: repeat 

5: Gibbs sample Zp using Eq. ([T]), and using Eq. ^ 

6: until Iteration limit or convergence 

7: Estimate test user's feature vector 9p from his Zp 

8: Use 9p to predict Spj, sjp for all friends j 

9: Output: Test user's 9p, Spj, sjp 



Updating all parameters a,r],S,u,a'^ requires linear time in the amount of data, so we update 
them once per Gibbs sampler sweep over all latent variables z, f , s. This ensures that every iteration 
(Gibbs sweep plus parameter update) takes linear time. 

3.2 Parallelizable Prediction Algorithm 

Our training algorithms learns topic parameters /3, z/, so that we can use our Prediction Al- 
gorithm ([2]) to predict feature vectors 9p and friendship topic-pair assignments Spj, sjp for all users 
p. For each user p independently and in parallel, we Gibbs sample her text latent variables Zp. , fp.. 
based on her observed documents Wp.. and the learnt parameters $, z/, a^. Then, using the defi- 
nition of our SM"^ generative process, we estimate p's feature vector 9p by averaging over her Zp.. 
Finally, we use 9p and the learnt topic parameters $ to predict p's most likely friendship topic-pair 
assignments 5*^ , 5*^ to each of her friends j, using this equation: 

(5* 5* ) = arg max 9p^a^a,b0j,b' (5) 

(a, 6) s.t. a<b 

We use these assignments to discover the topics that friendships are most frequently associated 
with. Like the training algorithm, the Prediction Algorithm also runs in linear time. 
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4 Experimental setting 



Our goal is to analyze Facebook users in the context of their interests, friendships and conversa- 
tions. Facebook users typically express interests such as "movies" or "cooking" by establishing a 
"like" relation with the corresponding Facebook pages, and our experiments focus on four popu- 
lar user interests in Facebook: camping, cooking, movies and sports. We selected these concepts 
because of their broad scope: not only are they generic concepts, but each of their pages was as- 
sociated with more than 5 million likes as of May 201 1, ensuring a sufficiently large user base for 
data collection. For each interest C, we collected our data as follows: 

1. Construct the complete data collection S{C) by randomly selecting 1 million users who like interest 
C (yi = +1), and 1 million who do not explicitly mention liking C {yi = —1). 

2. For each user i G S{C), collect the following dat^ 

• User text documents wik.: The text documents for user i contain all of her "status updates" from 
March 1st to 7th, 201 1 (each status update is one document), as well as titles of Facebook pages that 
she likes by March 7th 2011 (each page title is one document^ We preprocessed all documents 
using typical NLP techniques, such as stopword removal, stemming, and collocation identification 
d. 

• User-to-user friendships: We obtained these symmetric friendships using the friend lists of user i 
recorded on March 7th 201 1. 

3. Randomly sample 2% of S{C) to construct a 40,000-user training collection S{C). Across the four 
concepts, S{C) contained 340,128 to 385,091 unique words, 6,650,335 to 8,771,298 documents, 
16,421,601 to 22,521,507 words, and 1,292 to 2,514 link^ 

We first trained the SM"^ model using the training collection S{C) and X = 50 latent features 
(topics), stopping our Gibbs sampler at the 100th iteration because 1) the per-iteration increase 
in log-likelihood was < 1% of the cumulative increase, and 2) more iterations had negligible 
impact on our validation experiments. This process required 24 hours for each concept, using one 
computational thread. We note that one could subsample larger training collections S{C), thus 
increasing the accuracy of parameter learning at the expense of increased training time. A recently 
introduced alternative is to apply approximate parallel inference techniques such as distributed 
Gibbs sampling [16, 2J, but these introduce synchronization and convergence issues that are not 
fully understood yet. 

After learning topic parameters from the training collection S{C), we invoke Algorithm [2] 
on all users p G S{C) to obtain their predicted feature vectors 9^, and the friendship topic-pair 
"explanations" s^j, sjp for each of p's friends j. Note that Algorithm [2] is parallelizable over every 
user in S{C), and we observe that it only requires a few minutes per user; a sufficiently large cluster 
finishes all 2M users in a single day — in fact, given enough computing power, it is possible to 
scale our prediction to all of Facebook. In the following sections, we shall apply the predicted 
Op, Spj, Sjp to various analyses of Facebook's data. 

^We use only non-private user data for our experiments, e.g. chat logs or user messages are never looked at. 

^We remove the page title of concept C, because its distribution is highly correlated with the labels. 

^The relatively small number of links arises from unbiased random sampling of users; more links can be obtained 
by starting with a seed set of users and picking their friends, but this introduces bias. Also, our method uses evidence 
from negative links, so the small number of positive links is not necessarily a drawback. 
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5 Validation 



Before interpreting our results, we must validate the performance of our SM"^ model and algorithm. 
Because our model spans multiple data modalities, there is arguably no single task or metric that 
can evaluate all aspects of SM"^. What we shall do is test how well the SM"^ latent space and 
feature vectors predict held-out user interest labels from our data collections S{C). We believe 
this is the best task for several reasons: for one, we are concerned with interpreting user interests 
in the context of friendships and conversations, thus we must show that the SM"^ latent space 
accurately captures user interests. For another, predicting user interests is a simple and well- 
established task, and its results are therefore easier to interpret than model goodness-of-fit measures 
such as perplexity (as used in [|7]|). 

It is well-understood that textual latent space methods like Latent Dirichlet Allocation (LDA), 
while useful for summarization and visualization, normally do not improve classification accuracy 
— in fact, with large amounts of training data, they may actually perform worse than a naive 
Bag-of- Words (BoW) representation [|7]|. This stems from the fact that latent space methods are 
dimensionality reduction techniques, and thus distort the data by necessity. In our case, the picture 
is more complicated: the text aspect of our model loses information with respect to BoW, yet some 
non-textual information comes into play from the friendship links and labels in the small training 
collections S{C). We believe the best way to use SM"^ is to concatenate SM"^ features to the BoW 
features — this avoids the information loss from reducing the dimensionality of the text, while 
allowing the network and label information to come into play. We expect this to yield a modest 
(but statistically significant) improvement in accuracy over a plain BoW baseline. 

Our task setup is as follows: recall that for each interest C, we obtained a 2M data collection 
S{C) with ground truth labels for all user interests y^. The SM"^ algorithm predicts feature vectors 
Op for all users p G S{C), which can be exploited to learn a linear Support Vector Machine (SVM) 
classifier for the labels y^. More specifically, we use 6^ concatenated with user p's original BoW 
as feature inputs to LIB LINEAR [9], and then performed 10-fold cross-validation experiments on 
the labels y^. This was done for each of the four data collections S{C), and each experiment took 
< 1 hour. As a baseline, we compare to LIBLINEAR trained on BoW features only. The BoW 
features for user p are just the normalized word frequencies over all her documents. 

Table [l] summarizes our results. To determine if the improvement from SM"^ is statistically 
significant, we conducted a x^-test (one degree of freedom, 2M trials) against the BoW Baseline 
as a null hypothesis. The values are far below 0.001, suggesting that the improvement provided 
by SM"^ features is statistically very significant. This confirms our hypothesis that the SM"^ features 
improve classification accuracy, by virtue of encoding network and label information from the 
small training collections S{C). We expect that classification accuracy will only increase with 
larger training collections S{C), albeit at the expense of more computation time. 

6 Understanding User Interests and Friendships in Facebook 

In the introduction, we posed four questions about Facebook: 

• How does Facebook' s social (friendship) graph interact with its interest graph and conversational content? 
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Camping 



4.5% 



Cooking 



Ca (4.9%) +2.48 

country_music, farm, 
fish, george_strait, hunt, 
jason_aldean, mafia, 
parl<, texas_hold_em 



Ca (0.8%) -1.15 

bath, body, 
elementary_school, 
jail, live, net, sir, 
st_louis, tire, work 



4.6% 




Ca (1.6%) -0.23 

conservative, facebook, 
funer, military, project, 
soldier, stop, support, 
walt_disney, 
walt_disney_world 




11.2% 



Co (1.8%) -1.06 

bet, children, fan, fanatic, 
glee, million, nation, 
pants_on_the_ground, 
pizza, usa 



Co (1.1%) -0.65 

art, beauty, boutique, 
Italian, 
olive_garden, 
photography. 



^ restaurant, salon, studio / 



Facebook fanpages 



Ca (11.5%) +0.92, Co (12.2%) +0.61 
Mo (12.5%) +1.04, Sp (14.7%) +1.25 

adam_sandler, basketball, disney, drake, 
dr_pepper, family_guy, fresh_prince, 
hangover, harry_potter, movie, mtv, 
nicki_minaj, oreo, Simpsons, Starbucks, 
starburst, subway, skittles, twilight, youtube 



Ca (1.7%) -0.93. Co (1.4%) -0.91 ■ / 

Mo (1.3%) -1.02. Sp (1.4%) -0.63 ' . 



bet, click, cop, dear, fan, find, glug, head, 
Italian, justin_bieber, law_order, math, 
million, movie, office, org, page, problem, 
reach, solve, sorry, sound, stay_up_late, 
strong, sure, the_big_bang_theory, twilight 



Ca (1.9%) -0.59. Co (2.0%) -0.88 
Mo (1.4%) -1.01. Sp (0.8%) -1.10 

cash, chicken_coop, farm, farmville, flower, 

free, girl, group, hip_hop, join, kelloggs, 
mafia_wars, pop_tart, progress, work, zynga 



Informal conversation in status updates 



Ca (3.1%) +1.43, Co (29.9%) +6.24 

Mo (29.6%) +5.17. Sp (0.9%) -0.72 

buddy, care, don't, elf, find, friend, hear, 
homework, hurt, mad, mean, person, say, 
smile, sorry, stop, stupid, talk, text, truth 



Ca (5.8%) -0.86. Co (4.6%) -0.93 
Mo (6.6%) -0.42. Sp (12.4%) +2.65 

beauty, boy, boyfriend, cute, dude, friend, 
girl, girlfriend, guy, hot, mean, play, say, 
text, treat, ugly 



Ca (1.2%) -0.31. Co (0.8%) -0.30 
Mo (0.8%) -0.80. Sp (1.1%) -0.49 

annoy, answer, ask, dad, food, found, 
friend, hey, house, look, mom, mommy, 
mother, nevermind, smell, sex, slut, ye, yeah 




5.2% 



4.3% 



21.9% 



Mo (8.3%) +1.73 

ac_dc, beatles, family_guy, 

gummy_worm, history, 
metallica, music, Simpsons, 
south_park, sour 



Mo (0.6%) +1.64 

end, epic, find, listen, 
money, movie, music, 
pocket, sex, won 



Sp (0.8%) +3.04 

basketball, call_of_duty, 

dance, footbal, listen, 
music, nike, play, soccer, 
show 




greys_anatomy, espn, sportscenter, 
the_secret_life_of_the_american_teenager 



13.3% 



12.6% 




Mo (1.3%) -0.39 

art, design, music, 
photography, product, 
spell, studio, taught, 
toy, usa 



5.9% 



Mo (1.9%) -0.62 

bag, bar, center, 
family, food, grill, 
red_hot_chili_peppers, 
restaurant, sport 




4.4% 



Sp (0.7%) -0.06 

camp, farm, fish, 
grill, hunt, mafia, 
olive_garden, park, 
pizza, restaurant 



Sp (1.2%) -0.94 

art, blue, center, design, 
kim_kardashian, lie, 

photography, 
soulja_boy_tell_em 



Movies 



' 13.2% 



Sports 



Figure 3: A visual summary of the relationship between Facebook friendships, user conversations, and 4 types of 
user interests (best viewed in color). Topics specific to a particular interest are found in the corners, while common 
topics are found in the middle, divided into topics containing Facebook fanpage titles or status update lingo — note 
that we manually introduced this distinction for the sake of visualization; the SM"^ algorithm discovers all topics purely 
from the data. Thick borders highlight topics positively correlated with user interests, while dashed borders highlight 
negative correlation. Font colors highlight information relevant to a specific interest: blue for camping (ca), red for 
cooking (co), green for movies (mo), and purple for sports (sp). The colored heading in each topic describes its 
popularity, and its correlation with user interests: for example, "Ca (4.9%) +2.48" means this topic accounts for 4.9% 
of user text in the camping dataset, and has a moderate positive correlation with interest in camping. Finally, an edge 
between a pair of topics shows the proportion of friendships attributed to that pair (normalized by topic popularity). 
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Table 1: User interest classification accuracy (in percent) under a 10-fold cross-validation setup, for a Bag- 
of- Words baseline, and BoW plus SM"^ feature vectors. Each experiment is performed over 2 million users. 
We also report -statistics and values (1 degree of freedom), which show that adding SM^ features yields 
a highly significant improvement in accuracy. 



Features 


Sports 


Movies 


Camping 


Cooking 


BoW Baseline 
Plus SM"* 


78.91 
80.23 


78.51 
80.48 


79.85 
81.08 


77.22 
78.57 


X^-statistic 
value 


2.1 X 10^ 
< 0.001 


4.6 X 10^ 
< 0.001 


1.9 X 10^ 
< 0.001 


2.1 X 10^ 
< 0.001 



Are they correlated? 

• What friendship patterns occur between users with similar interests? 

• Do users with similar interests talk about the same things? 

• How do different interests (say, camping and movies) compare? Do groups of users with distinct interests 
also exhibit different friendship and conversational patterns? 

We shall answer these questions by analyzing our SM"^ output over the four user interests: camping, 
cooking, movies and sports. Such analysis is not only useful for content recommendation, but 
can also inform policies targeted at increasing connectivity (making more friends) and interaction 
(having more conversations) within the social network. Through continuous study of user interests, 
conversations and friendships, we hope to learn what makes the social network unique, and what 
must be done to grow it. 

6.1 Visualization procedure 

In Figure [sj we combine SM'^'s output over all four user interests into one holistic visualization, 
and the purpose of this section is to describe how we constructed said visualization. First, recall 
that for each interest C, our SM"^ system learns topic parameters from a training subset S{C) of 
user text documents, friendship links, and labels. These parameters are then used to infer various 
facts about the full user dataset S{C) : (1) user feature vectors Op that give their propensities towards 
various topics, and (2) each friendship's most likely topic-pair assignments Sij, Sji, which reveal 
the topics a given pair of friends is most likely to talk about. 

With these learnt parameters, we search for the 6 most strongly-recurring topics across all 
four interests, as measured by cosine similarity. These topics, shown in the middle of Figure 
[3} represent commonly-used words on Facebook, and provide a common theme that unites the 
four user interests. Next, for each interest, we search for the top 4 topic-pairs (including pairs of 
the same topic) with the highest friendship counts (which come from the topic-pair assignments 
Sij, Sji). Note that we first normalize each topic-pair friendship count by the popularit)]^ of both 
topics, in order to avoid selecting popular but low-friendship topics. We show these 4 topic-pairs in 

^The sum of a topic's weight over all user feature vectors 6p. 
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the corners of Figure [3} along with their normahzed friendship counts. These topic-pairs represent 
conversations between friends; more importantly, if the topics are also positively correlated with 
the user interest — say, camping — then they reveal what friends who like camping actually talk 
about. This context- specificity is especially valuable for separating generic chatter from genuine 
conversation about an interest. 

Figure [3] was constructed by these rules, but with one exception: we include a Movies topic 
(heading Mo (0.6%) +1.64) that lacks strong friendships, yet is positively correlated with interest 
in movies. This anomaly demonstrates that interest- specific conversations do not always occur 
between friends — in other words, the presence of an interest- specific conversation does not imply 
the existence of friendship, which is something that text-only systems may fail to detect. In turn, 
this highlights the need for holistic models like SM"^ that consider interests, conversations and 
friendships jointly. 

6.2 Observations and Analysis 

Common Topics Throughout these sections, we shall continually refer to Figure |3} The most 
striking observation about the four interests (camping, cooking, moving, sports) is their shared 
topical content, shown in the middle of the Figure. These topics represent a common lingo that 
permeates throughout Facebook, and that can be divided into two classes: "Facebook fanpages", 
consisting of named entities that have pages on Facebook for users to like, and "Informal conver- 
sation in status updates", which encompasses the most common, casual words from user status 
updates. 

We observe that the fanpage topic starting with "adam_sandler" is dominant, with popularity 
> 10% across all four user interest datasets. Additionally, this topic has a mild positive correlation 
with all interests, meaning that users who have any of the four interests are more likely to use this 
topic. In contrast, the fanpage topic starting with "cash" only has average popularity (between 
1 — 2%) and mild negative correlation with all interests. Observe that this topic is dominated 
by social gaming words ("farmville", "mafia_wars"), whereas the other, popular topic is rich in 
popular culture entities such as "Disney", "Dr Pepper", "Simpsons" and "Starbucks". This data 
provides evidence that users who exhibit any of the four interests tend to like pop culture pages 
over social gaming pages. Notably, none of these four interests are related to internet culture or 
gaming, which might explain this observation. 

The informal conversation topics are more nuanced. Notice how the topic starting with "buddy" 
is both popular and strongly correlated with respect to cooking and movies, implying that the 
conversations of cooking/movie lovers differ from camping/sports lovers. Also, notice that the 
topic starting with "beauty" is dominated by romantic words such as "boyfriend" and "girlfriend", 
and is popular/correlated only with sports — perhaps this lends some truth to the stereotype that 
school athletes lead especially active romantic lives. Finally, the topic starting with "annoy" and 
containing words such as "dad", "mom" and "house" carries a slight negative sentiment for all 
interests (in addition to being unpopular). This seems reasonable from the average teenager's 
perspective, in which parents normally have little connection with personal interests. 
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High-Friendship Topics We turn to the high-friendship topics in the corners of Figure [3j Some 
of these contain a high degree of self-friendships, implying that friends usually converse about 
the same topic, rather than different ones. To put it succinctly, in Facebook, the interest graph 
is correlated with the social (friendship) graph. In fact, the average proportion of same-topic 
friendships ranges from 0.2% to 0.6% depending on interest, whereas the average proportion of 
inter-topic friendships is an order of magnitude lower at 0.02% to 0.04%. Intuitively, this makes 
sense: any coherent dialogue between friends is necessarily about a single topic; multiple-topic 
conversations are hard to follow and thus rare. 

One interpretation of inter-topic friendships is that they signify two friends who rarely interact, 
hence their conversations on the whole are topically distinct. In other words, inter-topic friend- 
ships may represent socially weaker ties, compared to same-topic friendships. As an example, 
consider the cooking topics starting with "art" and "conservative" respectively. The former topic 
is about the visual arts ("design", "photography", "studio"), whereas the latter topic is about po- 
litical conservatives in America ("military", "soldier", "support"). It seems implausible that any 
conversation would be about both topics, and yet there are friendships between people who talk 
about either topic — though not necessarily with each other. 

A second observation is that most interests have more than one positively correlated topic (with 
the exception of camping). A good example is cooking: notice the topics starting with "beach" 
and "beatles" respectively. The former topic has connotations of fine living, with words like "city", 
"club", "travel" and "wine", whereas the latter is associated with entertainment culture, containing 
phrases like "beatles", "family _guy", "pink_floyd" and "star_wars". Both topics have statistically 
much in common: moderate popularity, positive interest correlation with cooking, and a significant 
proportion of self-topic friendships. Yet they are semantically different, and more importantly, do 
not have a significant proportion of friendships between them. Hence, these two topics represent 
separate communities of cooking lovers: one associated with the high life, the other with pop 
culture. The fact that cooking lovers are not homogenous has significant implications for policy 
and advertising; a one-size-fits-all strategy is unlikely to succeed. 

Similar observations can be made about sports and movies: for sports, both a television topic 
("family _guy", "greys_anatomy", "espn") and an actual sports topic ("basketball", "football", "soc- 
cer") are positively correlated with interest in sports, yet users in the former topic are likely watch- 
ing sports rather than playing them. As for movies, one topic is connected with restaurants and bars 
("bar", "food", "grill", "restaurant"), while the other is connected with television ("family _guy", 
"simpsons", "south_park"). 

Our final observation concerns the "friendliness" of users in positive topics — notice that the 
users of some positively correlated topics ("country .music" from camping, "ac_dc" from movies, 
"beatles" from cooking") have plenty of within-topic friendships, yet possess almost no friendships 
with other topics. In contrast, users in topics like "beach" from cooking or "beatles" from sports are 
highly gregarious, readily making friends with users in other topics. The topic words themselves 
may explain why: notice that the "beach" cooking topic has words like "club", "grill" and "travel" 
that suggest highly social activities, while the "beatles" sports topic contains television-related 
words such as "family _guy" and "espn", and television viewing is often a social activity as well. 

In closing, our analysis demonstrates how a multi-modal visualization of Facebook's data can 
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lead to insights about network connectivity and interaction. In particular, we have seen how fan- 
pages and casual speech serve as a common anchor to all conversations on Facebook, how same- 
topic friendships are far more common (and meaningful) than inter-topic friendships, and how 
users with common interests can be hetorogenous in terms of conversation topics. We hope these 
observations can inform policy directed at growing the social network, and increasing the engage- 
ment of its users. 

7 Related Work and Conclusion 

The literature contains other topic models that combine several data modalities; ours is distin- 
guished by the assumptions it makes. In particular, existing topic models of text and network data 
either treat the network as an outcome of the text topics (RTM [8J), or define new topics for each 
link in the network (ART ifTSll ). The Pairwise Link-LDA model of Nallapati et al. ifTSll is the most 
similar to ours, except (1) it does not model labels, (2) it models asymmetric links only, and cru- 
cially, (3) its inference algorithm is infeasible for even P = 40, 000 users (the size of our training 
S'(C)'s) because it models all O(P^) positive and zero links. Our model escapes this complexity 
trap by only considering the positive links. 

We also note that past work on Facebook' s data |fT9l used the network implicitly, by summing 
features over neighboring users. Instead, we have taken a probabilistic perspective, borrowing 
from the MMSB model [[H to cast links into the same latent topic space as the text. Thus, links 
are neither a precursor to nor an outcome of the text, but equals, resulting in an intuitive scheme 
where both text and links derive from specific topics. The manner in which we model the labels is 
borrowed from sLDA [6J, except that our links also influence the observed labels y. 

In conclusion, we have tackled salient questions about user interests and friendships on Face- 
book, by way of a system that combines text, network and label data to produce insightful vi- 
sualizations of the social structure generated by millions of Facebook users. Our system's key 
component is a latent space model (SM"^) that learns the aggregate relationships between user text, 
friendships, and interests, and this allows us to study millions of users at a macroscopic level. The 
SM"^ model is closely related to the supervised text model of sLDA [|6l and the network model of 
MMSB [IJ, and combines features of both models to address our challenges. We ensure scalability 
by splitting our learning algorithm into two phases: a training phase on a smaller user subset to 
learn model parameters, and a parallel prediction phase that uses these parameters to predict the 
most likely topic vectors 6^ for each user, as well as the most likely friendship topic-pair assign- 
ments Sij, Sji for all friendships eij = 1. Because the inference phase is trivially parallelizable, our 
system potentially scales to all users in Facebook. 
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